# **2018 Exam - Computer Architecture** CS3021

Two Hour Exam

Answer 3 out of 4 Questions

20 marks per question

40 mins per question

i) Briefly describe IA32 and x64 procedure calling conventions. Make sure that you describe the register set, draw a diagram of a typical procedure stack frame and list the volatile registers. In the case of x64 explain the use of shadow space. [6 marks]

### <u>IA32</u>

When calling a function in IA32 all parameters must be pushed onto the stack from right to left. So if calling min(a,b,c) you must first push a, then b, then c. Once parameters have been pushed you must then call and enter the function.

- 1. Push parameters to stack (right to left)
- 2. Call and enter function

Within the function you must initially push the frame pointer to the stack to preserve it for when exiting. If there are any local variables used, space must be allocated for these upon the stack initially. Also, if any non volatile registers are used they must also be pushed to the stack and preserved.

- 3. Preserve frame pointer (**push ebp**)
- 4. Update frame pointer with stack pointer (**mov ebp, esp**)
- 5. Allocate space for local variables (**sub esp, 8** -> space for two local variables)
- 6. Preserve non-volatile registers if used (**push ebx**)

From here the function body can proceed and fulfill its desired functionality. Before exiting the function must perform the following:

- 7. Restore non-volatile registers saved (**pop ebx**)
- 8. De-allocate space for local variables (add esp, 8)
- 9. Restore esp (**mov esp, ebp**)
- 10. Restore frame pointer (**pop ebp**)
- 11. Return (**ret 0**)



Parameters are pushed right to left onto the stack.

Local variables are accessed using a negative offset.

### X64

X64 function calling is quite similar to that of IA32 just with an extended register set and what's known as *shadow space*. Shadow space is 32 bytes of pre-allocated space that every function caller must allocate before calling a function and de-allocate following a return. This way within every function there is  $4 \times 8$  bytes available for parameters. Any additional parameters must be pushed onto the stack. Parameters, like in IA32 are passed from right to left.

- 1. Pass parameters to 1-4 to rcx, rdx, r8 and r9 (**mov rcx, rax**)
- 2. Allocate shadow space (**sub esp, 32**)
- 3. Call function (call min)
- 4. De-allocate shadow space (add esp, 32)
- 5. Result found in rax



# ii) What is the main advantage of the x64 procedure calling convention compared with the IA32 procedure calling convention? [2 marks]

The main advantage of x64 procedure calling vs IA32 is the implementation of shadow space. This allows for 4 registers to be pre-allocated for use with parameters or any other desired usage for every given function. This removes the need for unique stack and frame pointer maintenance and handling throughout functions.

It implements a generic protocol/procedure that every function call must follow and allows for programmers to focus on writing their code and not worrying about stack and frame pointers being handled incorrectly.

# iii) Convert the following code segment into IA32 and x64 assembly.

```
INT g = 4;

INT max(INT a, INT b, INT c) {
    INT v = a;
    if (b > v)
        v = b;
    if (c > v)
        v = c
    return v;
}

INT p(INT i, INT j, INT, k, INT l) {
    return max(max(g, i, j), k, l);
}
```

## **IA32**

```
public
        max
         push ebp
              ebp, esp
              eax, [ebp+8] ; V = a
              ecx, [ebp+12] ; ecx = b
         jle
              max_1
              eax, ecx
max_1:
              ecx, [ebp+16]
         cmp
         jle
              max_2
max_2:
              esp, ebp
              ebp
```

# <u>x64</u>

```
public
          max
min:
                rax, rcx
          mov
          cmp
                rdx, rax
                max_1
          jle
          mov
                rax, rdx
max_1:
                r8, rax
          jle
                max_2
          mov
                rax, r8
max_2:
          ret
                0
```

i) How are virtual addresses converted into physical addresses by a two level page table? Assume a page table structure as per an IA32 based CPU. Draw diagrams to illustrate your answer.

The Memory Management Unit (MMU) converts virtual addresses generated by the CPU into physical addresses in memory using page tables. Address space is divided into fixed sized pages. Each process runs in its own 4GB virtual address space which has pages mapped by the MMU to memory.

Virtual page numbers are converted to physical page numbers using a page-table lookup. The virtual page number is used as an index into a page table stored in physical memory. The page table base register (PTB) contains the physical address of the page table of the current process. 4MB physical memory needed to hold the page table of every process







Using the virtual address and the primary page (primarytable[index1]) table we can find the address of the secondary page table and from this can get the physical address (secondarytable[index2]) of the data in memory.

ii) What is the advantage of using a two level page table compared with a single level page table? Comment on the amount of memory needed for a small process and a maximum sized process using a single and two level page table structure.

To reduce the size of the page table structure an n-level look up table can be used. This means that the larger the process, the more memory is needed for its tables. However, the smaller the process the less memory that is needed.

Each process has one 4MB primary page table and multiple 4KB secondary, tertiary... page tables. Secondary page tables are created on demand as a process' size exceeds that of the capability of a single secondary table and needs more address spaces. This way, small processes never over-use memory and memory is utilized efficiently for both small processes and maximum sized processes.

A process in a 32 bit system can theoretically use each of the possible 1 million 20 bit page locations in its table if necessary. A single 32-bit page table therefore would need 4MB of RAM (32bit = 4 bytes, 4 bytes \* 1million rows = 4MB). A two level page table is far more memory usage efficient as the virtual page address is split into two addresses (index1 - primary table index points to secondary page table AND index2 - secondary table index points to address in memory). This means that each primary and secondary table is now 1024 lines long. The same capabilities apply as the 1024 primary table addresses can map to 1024 secondary table addresses (1024^2 = 1 million). More secondary tables after the first one only need to be created as they are required thus saving space in RAM.

|                    | min process (i)                                                                                              | max process (ii)                                                      |
|--------------------|--------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|
| 1 level page table | 4MB                                                                                                          | 4MB                                                                   |
| 2 level page table | 4KB + 4KB<br>(1 primary level table always<br>present, smallest process will<br>need at least one secondary) | 4KB + 4MB<br>(1 primary level table + 1024<br>secondary level tables) |

ii) Draw a diagram showing the structure of an IA32 2 level kernel page table which maps the first 16MB of the kernels virtual address space starting at address 0x00000000 onto 8MB of memory located at physical address 0x10000000 and 8MB of memory located at physical address 0x20000000.

 $\underline{\textbf{4 x Primary Tables}}$  →  $4 \times [1024 \text{ Secondary Table Mappings } \times 4 \text{KB each} = 4 \text{MB each}]$  → 16MB

i) With the aid of a diagram explain how a cache organisation can be characterised by three constants, LKN. Explain in detail, how a data item is accessed in an LKN cache. Why does a cache normally reduce the effective memory access time? [6 marks]



- N sets of cache lines/tages
- K directories
- L bytes per cache line (L=16  $\rightarrow$  4 x 32bit words per cache line)

When referencing/searching a K-Way cache, each cache address is mapped to a particular set (0-N).



# [tag] [set #] [offset]

If an address maps to set 1, the set 1 tags of all K directories are compared with the incoming address tag simultaneously (fully associative).

If a match is found (hit), corresponding data returned offset within cache line. The K data lines in the set are also fully associative and accessed concurrently. If a match is not found (miss), read data from memory, place in cache line within set and update corresponding

cache tag (choice of K positions). This replacement depends on the cache line replacement strategy - Least Recently Used (LRU)

ii) Compute the number of hits and misses if the following list of hexadecimal addresses is applied to (1) a 64 byte direct mapped cache with 16 bytes per line and (2) a 64 byte fully associative cache with 16 bytes per line.

**(1)**  $L \times K \times N = 64$ 

L = 16

K = 1

N = 4

# Can ignore first 00 since all the same and the last 0 since don't care about offset

| HEX    | BINARY                               | TAG | <u>SET</u> | <u>OFFSET</u> | HIT/MISS | CACHED<br>AT |
|--------|--------------------------------------|-----|------------|---------------|----------|--------------|
| 0x0000 | <mark>0000</mark> 0000               | 00  | 00         | -             | MISS     | N = 0        |
| 0x0010 | <mark>00</mark> 01 <del>0000</del>   | 00  | 01         | -             | MISS     | N = 1        |
| 0x0020 | <mark>0010</mark> 0000               | 00  | 10         | -             | MISS     | N = 2        |
| 0x0030 | <mark>00</mark> 11 <del>0000</del>   | 00  | 11         | -             | MISS     | N = 3        |
| 0x0034 | <mark>00</mark> 11 <del>0100</del>   | 00  | 11         | -             | HIT      | N = 3        |
| 0x0020 | <mark>0010</mark> 0000               | 00  | 10         | -             | HIT      | N = 2        |
| 0x0010 | <mark>00</mark> 01 <del>0000</del>   | 00  | 01         | -             | HIT      | N = 1        |
| 0x000C | <mark>00</mark> 00 <del>1100</del>   | 00  | 00         | -             | HIT      | N = 0        |
| 0x0050 | <mark>01</mark> 01 <del>0000</del>   | 01  | 01         | -             | MISS     | N = 1        |
| 0x0040 | <mark>0100</mark> 0000               | 01  | 00         | -             | MISS     | N = 0        |
| 0x002C | <mark>00</mark> 10 <mark>1100</mark> | 00  | 10         | -             | HIT      | N = 2        |
| 0x0008 | <mark>00</mark> 00 <del>1000</del>   | 00  | 00         | -             | MISS     | N = 0        |
| 0x0030 | <mark>00</mark> 11 <del>0000</del>   | 00  | 11         | -             | HIT      | N = 3        |
| 0x0020 | <mark>00</mark> 10 <del>0000</del>   | 00  | 10         | -             | HIT      | N = 2        |
| 0x0010 | <mark>00</mark> 01 <del>0000</del>   | 00  | 01         | -             | MISS     | N = 1        |
| 0x0000 | <mark>0000</mark> 0000               | 00  | 00         | -             | HIT      |              |

**(2)**  $L \times K \times N = 64$ 

L = 16

K = 4

N = 1

# Since N = 1, set always 0

| <u>HEX</u> | BINARY                             | TAG  | <u>SET</u> | OFFSET | HIT/MISS | REPLACING<br>LRU |
|------------|------------------------------------|------|------------|--------|----------|------------------|
| 0x0000     | <mark>0000</mark> 0000             | 0000 | 0          | -      | MISS     |                  |
| 0x0010     | 0001 <mark>0000</mark>             | 0001 | 0          | -      | MISS     |                  |
| 0x0020     | 0010 <mark>0000</mark>             | 0010 | 0          | -      | MISS     |                  |
| 0x0030     | 0011 <del>0000</del>               | 0011 | 0          | -      | MISS     |                  |
| 0x0034     | <mark>0011</mark> <del>0100</del>  | 0011 | 0          | -      | HIT      |                  |
| 0x0020     | <mark>0010</mark> <del>0000</del>  | 0010 | 0          | -      | HIT      |                  |
| 0x0010     | <mark>0001</mark> <del>0000</del>  | 0001 | 0          | -      | HIT      |                  |
| 0x000C     | <mark>0000</mark> <del>1100</del>  | 0000 | 0          | -      | HIT      |                  |
| 0x0050     | 0101 <mark>0000</mark>             | 0101 | 0          | -      | MISS     | 0011             |
| 0x0040     | 0100 <mark>0000</mark>             | 0100 | 0          | -      | MISS     | 0010             |
| 0x002C     | <mark>0010</mark> <del>1100</del>  | 0010 | 0          | -      | MISS     | 0001             |
| 0x0008     | <mark>0000</mark> 1 <del>000</del> | 0000 | 0          | -      | HIT      |                  |
| 0x0030     | <mark>0011</mark> <del>0000</del>  | 0011 | 0          | -      | MISS     | 0101             |
| 0x0020     | <mark>0010</mark> <del>0000</del>  | 0010 | 0          | -      | HIT      |                  |
| 0x0010     | <mark>0001</mark> <del>0000</del>  | 0001 | 0          | -      | MISS     | 0100             |
| 0x0000     | 0000                               | 0000 | 0          | -      | HIT      |                  |

LRU =  $\frac{0000}{0001}$  /  $\frac{0001}{0001}$  /  $\frac{0010}{0001}$  /  $\frac{0000}{0001}$  /  $\frac{0000}{0000}$  /  $\frac{0101}{0000}$  /  $\frac{0000}{0001}$  /  $\frac{0000}{0001}$  /  $\frac{0000}{0001}$ 

iii) Would you expect an associative cache of size N and line size L to always have a higher hit rate than a direct mapped cache of size N and line size L. Explain the reasoning behind your answer.

### Consider two caches:

- K = 4, N = 1, L = 16 [64 byte fully associative]
- K = 1, N = 4, L = 16 [64 byte direct mapped]

# L \* K \* N all equal

Consider the following address sequences repeatedly querying cache:

[a], [a+16], [a+32], [a+48], [a+64] 
$$\rightarrow$$
 5 addresses

Our caches can contain 4 addresses, sequence comprises of 4



<u>Fully Associative:</u> Only 4 addresses can fit in the 4-way cache so, due to the LRU replacement policy, every access will be a miss.

<u>Direct Mapped:</u> Since ONLY addresses [a] and [a+64] will conflict each other as they map to the same set, there will be 2 misses and 3 hits per cycle of 5 addresses.

i) What is the cache coherency problem? Briefly explain the states and operation of the Write Once cache coherency protocol.

The cache coherency problem is an issue regarding cache's and physical memory where a CPU might make changes to a value in cache. This change must be propagated to both physical memory and any other cache's that have access to this same shared address/value.

### **Write-Once Protocol**

- Uses write-back caches to reduce bus traffic.
- Each cache line can be in one of 4 states; **Valid, Invalid, Reserved, Dirty**
- **Invalid:** Wrong copy, fetch from memory.
- **Valid:** Cache line valid, present in one or more caches
- **Reserved:** Cache line <u>only in local cache</u>, send copy to memory
- **<u>Dirty:</u>** Cache line <u>only in local cache</u>, memory has not been updated yet
- <u>Valid</u> → <u>Reserved</u> when written to for first time (write-back to memory).
- Reserved → Dirty when written to multiple times, enters a write-back cycle and marked as dirty since now the only valid copy in the system.
- Caches monitor bus traffic, if read from a RESERVED copy, mark as valid
- If a read observed from DIRTY copy, cut read operation and supply local copy, also now write data to memory and mark as valid.

ii) Describe the changes that occur in a three CPU multiprocessor system where each CPU has its own cache.

 $\underline{t} = \underline{0} \rightarrow \text{CPU 0}$  attempts to read a0. This will initially be empty and must be fetched from memory and loaded into CPU 0 set 0. It is initially marked as valid.

 $\underline{t} = \underline{1} \rightarrow \text{CPU 0}$  attempts to read a0. Since a0 is already in the cache and is marked as valid the correct value is read from the CPU without needing to access memory.

- $\underline{t=2} \rightarrow \text{CPU 1}$  attempts to read a0. This will be initially empty and must be fetched from memory and loaded into CPU 1 set 0. It is initially marked as valid.
- $\underline{t=3}$  → CPU 1 attempts to write to a0. This change will be propagated to memory over the bus and within CPU 1 set0 will be marked as Reserved. Since a0 has now changed in memory, CPU 0's value of a0 is now marked as invalid
- $\underline{t=4}$  → CPU 2 attempts to read a0. This will initially be empty and must be fetched from memory and loaded into CPU 2 set 0. It is initially marked as valid. **CPU 1's value of a0 will** now also be marked as valid since it is shared.
- $\underline{t=5} \rightarrow \text{CPU 2}$  attempts to write to a0. This change will be propagated to memory over the bus and within CPU 2 set0 will be marked as Reserved. Since a0 has now changed in memory, CPU 0 and CPU 1's value of a0 is now marked as invalid.
- $\underline{t=6}$  → CPU 0 attempts to write to a0. Since a0 is invalid in this cache the value of a0 must be read from memory. It is fetched from memory into CPU 0 and marked as valid. CPU 2's value of a0 is now also marked as valid since it is now shared. CPU 0 then proceeds to write to a0. This change is propagated back to memory and a0 is marked as invalid in both CPU 1 and 2. It is now marked as Reserved in CPU 0.
- $\underline{t=7} \rightarrow \text{CPU 0}$  now attempts to read a0. Since a0 is reserved in this cache the value of a0 is read directly from the cache and no traffic is passed along the bus.
- $\underline{t=8} \rightarrow \text{CPU 1}$  attempts to read a0. Since a0 is invalid in this cache the value of a0 must be read from memory. It is fetched from memory into CPU 1 and marked as valid. CPU 0's value of a0 is now also marked as valid since it is now shared.
- $\underline{t=9} \rightarrow \text{CPU 0}$  attempts to write to a0. Since a0 is valid, no data is fetched from memory and the write is performed immediately. This change is propagated to memory and the status of a0 in CPU 0 is set to reserved. CPU 1's value of a0 is now marked as invalid.
- $\underline{t = 10} \rightarrow \text{CPU 1}$  attempts to write to a0. Since a0 is invalid, the up to date value is first fetched from memory. When this is fetched it is marked as valid and so is CPU 0's copy of a0. After this the write is performed and the change is propagated back to memory. CPU 1's value of a0 is marked as reserved and CPU 0's value is marked as invalid.

 $\underline{t=11} \rightarrow \text{CPU 1}$  again attempts to write to a0. Since a0 is reserved the write is performed in place and no data is fetched or transferred along the bus. The value of a0 is now marked as dirty in CPU 1 as it is the only cache with a valid copy of a0. If any other CPU attempted to then read a0, CPU 1 would have to interrupt the read and pass the new value of a0 to both this CPU and memory.

i) What is the cache coherency problem? Briefly explain the states and operation of the MESI cache coherency protocol.

The cache coherency problem is an issue regarding cache's and physical memory where a CPU might make changes to a value in cache. This change must be propagated to both physical memory and any other cache's that have access to this same shared address/value.

### **MESI Protocol (Vivio)**

- Very similar to write-once however, like Firefly has a SHARED bus signal
- This way cache line can enter cache and be placed directly in shared/exclusive state
- <u>M</u>odified → Dirty
- **E**xclusive → Reserved
- **S**hared → Valid
- Invalid → Invalid

What is the difference between MESI and Write-Once? Cache line may enter cache and be placed directly in the Exclusive state. Write-once and write-through cycles no longer necessary if cache line is exclusive.

ii) Describe the changes that occur in a three CPU multiprocessor system where each CPU has its own cache.

 $\underline{t=0} \rightarrow \text{CPU 0}$  attempts to read a0. Since it is not present in its register set it must fetch it from memory. It is then placed in the exclusive state.

- $\underline{t=1} \rightarrow \text{CPU 0}$  attempts to read a0 again. Since it is already present in its register set and is exclusive it is read directly from here and not fetched from memory.
- $\underline{t=2}$  → CPU 1 attempts to read a0. Since it is not present in its register set it must be fetched from memory. CPU 0 sees this read and pings the shared bus and now a0 is marked as SHARED in both CPU 1 and CPU 0.
- $\underline{t=3} \rightarrow \text{CPU 1}$  attempts to write to a0. Since it is already present no data is fetched from memory. A0 is updated locally and since the value is shared this change is propagated to memory and CPU 0's value of a0 is marked as invalid and CPU 1's value of a0 is marked as exclusive.
- $\underline{t} = \underline{4}$  → CPU 2 attempts to read a0. Since it is not in its data set's it is fetched from memory and stored in its cache. The value of a0 is then marked as SHARED in both CPU 1 and CPU 2.
- $\underline{t=5} \rightarrow \text{CPU 2}$  attempts to write to a0. Since it is already present no data is fetched from memory. A0 is updated locally and since the value is shared this change is propagated to memory and CPU 1's value of a0 is marked as invalid and CPU 2's value of a0 is marked as exclusive.
- $\underline{t=6}$  → CPU 0 attempts to write a0. Since its value is marked as INVALID it must first be fetched from memory. It is then written to in CPU 0's cache and updated in memory. The value of a0 is marked as EXCLUSIVE in CPU 0 and CPU 1 and 2 are INVALID.
- $\underline{t} = 7 \rightarrow \text{CPU 0}$  attempts to read a0. Since its value is EXCLUSIVE the read is performed directly from the cache itself and no data is fetched from memory.
- $\underline{t=8} \rightarrow \text{CPU 1}$  attempts to read a0. Since its value is INVALID the read is performed from memory. CPU 0 also sees this read occurring and changes its state of a0 to SHARED.
- $\underline{t=9} \rightarrow \text{CPU 1}$  attempts to write to a0. This write is performed and the change propagated to memory. It is then marked as EXCLUSIVE and a0's flag within CPU 1 is changed to INVALID.

 $\underline{t = 10} \rightarrow \text{CPU 1}$  attempts to write to a0 again. Since exclusive, this write is only applied locally and no data is transferred to memory. The flag is also changed to MODIFIED.

 $\underline{t=11} \rightarrow \text{CPU 2}$  attempts to read a0. Since INVALID it attempts to fetch from memory. CPU 1 see's this read occurring on a MODIFIED value of his and intervenes. CPU 1 supplies the data to CPU 2 and also propagates the change to memory. They are then both marked as SHARED.